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At the time that Watson and Crick proposed a structure 
for DNA, a visionary might have suggested that the 
complete genetic sequence of an organism would eventu- 
ally be known. However, nobody could have realistically 
proposed that machines could automatically indicate gene 
functions. Yet precisely this has been achieved: with no 
laboratory experiments at all, the roles of most genes in 
several organisms have been reported. 

But how reliable are these functional a: 
which we depend for understanding gen 
Without laboratory experiments to ve 
tational methods and their expert analysi 
to know for certain. However, a simple 
place a rough upper bound on their accuracy. I have com- 
pared three different groups' functional annotation 1-3 for 
the Mycoplasma genitalium genome 1 (Fig. 1). Where two 
groups' descriptions are completely incompatible, at least 
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for vague or absent functional assignment. Furthermore, I 
always assume that as many groups as possible have the 
right description (Fig. 2). 

The results are disappointing for those expecting reliable 
annotation (Table 1). M. genitalium was reported to have 
just 468 genes, many of which are fundamental for all life and 
therefore easy to analyse. Nonetheless, the error rate is at 
least 8% for the 340 genes annotated by two or three groups. 
This value may not be uniform across the three groups, nor 
does it reflect the overall significance of a group's results. 
Genes annotated by only one group were not considered, 
but include such improbable bacterial functions as B-cell 
enhancing factor, mitochondrial polymerase, and seretonin 
receptor. This analysis cannot detect those cases where 
multiple groups arrived at consistent but wrong conclu- 
sions — a likely occurrence because all relied on similar 
methods and data. This evaluation also ignores minor dis- 
agreements in annotation, and disparities in degree of 
specificity (possibly indicating problematic overprediction 
of function 4 ). Therefore, the true error rate must be 

There are several possible reasons why the functional 
analyses have mistakes, as described at greater length else- 
where 5-8 . For example, it may be that the similarity 
between the genomic query and database sequence is 
insufficient to reliably detect homology, an issue solvable 
by appropriate use of modern and accurate sequence com- 
parison procedures 9 ' 10 . A more difficult problem is accurate 
inference of function from homology. Typical database 
searching methods are valuable for finding evolutionarily 
related proteins, but if there are only about 1000 major 
superfamilies in nature 11 ' 12 , then most homologs must 
have different molecular and cellular functions. 

The annotation problem escalates dramatically beyond 
the single genome, for genes with incorrect functions are 
entered into public databases 8 . Subsequent searches 
against these databases then cause errors to propagate to 
future functional assignments. The procedure need cycle 
only a few times without corrections before the resources 
that made computational function determination possible 
- the annotation databases - are so polluted as to be 
almost useless. To prevent errors from spreading out of 
control, database curation by the scientific community 
will be essential 4 ' 13 . 

To ensure that databases are kept usable, the intent of a 
gene annotation should be clear: does it indicate homolog, 
ortholog, and/or functional equivalence? Fortunately, some 
databases already incorporate this information explicitly 
(e.g. Ref. 14). Errors will, of course, still creep in. To help 
eliminate the collateral damage, computational assign- 
ments should clearly be flagged as such, and they should 
also indicate their source (which would allow propagation 
of corrections) and a measure of confidence in their accu- 
racy. This will require new research and development in 
algorithms and databases, and a broad commitment to 
maintaining these resources. In short, the accessible docu- 
mentation needed for reproducibility of a computational 
function determination should be commensurate with that 
for a corresponding laboratory bench experiment. 
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An exception is when one group uses a gene name and another specifically notes that the current gene is a paralog and not identical (consider mgOlO). Where the 

descriptions from different groups were compatible, but of different levels of specificity, this was considered a correct assignment (e.g. mg225). The difficulty of 
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they reflect i c n [ i i makes this analysis imprecise. Generally, the approach here is generous 

t is usually more permissive than Ref. 5. mg463: Frasier et al? and Koonin et al? describe different aspects 
is et al? description is compatible with that from Koonin et al}, but less specific. All three annotations are 
fa/. 1 and Ouzounis et al:' agree that this is a DNA primase. Koonin et al.'-' use a different gene name and 
3 of the common functional descriptions, all three are considered correct. However, if Koonin et at'' had been 
:heir annotation would have been marked as conflirtm 1 Notetl 1' t 50 is also annotated as a DNA primase 
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notation of histidine permease is more : 
rediction of function, or it could be cor 

lized. (b) Inconsistent annotations. mg302: lack of a functional assignment from Frasier eta 
s are wholly inconsistent. This leads to a conflict and a minimum error rate of 50%. Note that the assessment 
also behaves correctly when two annotators provide different functions for a multi-functional enzyme: each of the annotators is half right and half 
; assessment assigns a 50% error rate. mg448: Frasier et al. ] and Ouzounis et al? both describethe gene as pilB. The encoded protein is involved in 
i, and its biochem cal function is catalysis of methionine sulfoxide oxidation/reduction in proteins. The Koonin et al? annotation, chaperone-like 
:ivably be compatible but this is not likely. Because of uncertainty regarding compatibility of the Koonin eta!?' annotation and its qualification 
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threshold of consideration. Forthis analysis, the Koonin et al?' annotation was considered to be in conflict with 
the others, giving a minimum error rate of 33%. mg085: all three groups provide contradictory functions. The function described by Frasier et al? of HMG-CoA 
ductase is EC 1.1.1.34, while the NADH-ubiquinone oxidoreductase annotated by Ouzounis etal? (nu6m_marpo) is EC 1.6.5.3. Neither enzyme uses ATP or GTP, 
specified by Koonin etal? The analysis assumes one is correct and marks two incorrect. Note: Ouzounis et at. 3 annotations equivalent to SWISS-PROT included in 
imples are not included in the Table 1 analysis. 
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